library(readr)
library(dplyr)
library(tidyverse)
library(ROCR)
library(ggplot2)
library(ggridges)
library(plotly)
library(ggbreak)
library(maps)
library(mapdata)
library(ggmap)
library(gapminder)
library(kableExtra)
library(dendextend)
library(tree)
library(maptree)
library(glmnet)
library(randomForest)
library(gbm)
library(neuralnet)
| state | county | candidate | party | total_votes |
|---|---|---|---|---|
| Delaware | Kent | Joe Biden | DEM | 44552 |
| Delaware | Kent | Donald Trump | REP | 41009 |
| Delaware | Kent | Jo Jorgensen | LIB | 1044 |
| Delaware | Kent | Howie Hawkins | GRN | 420 |
| Delaware | New Castle | Joe Biden | DEM | 195034 |
| CountyId | State | County | TotalPop | Men | Women | Hispanic | White | Black | Native | Asian | Pacific | VotingAgeCitizen | Income | IncomeErr | IncomePerCap | IncomePerCapErr | Poverty | ChildPoverty | Professional | Service | Office | Construction | Production | Drive | Carpool | Transit | Walk | OtherTransp | WorkAtHome | MeanCommute | Employed | PrivateWork | PublicWork | SelfEmployed | FamilyWork | Unemployment |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1001 | Alabama | Autauga County | 55036 | 26899 | 28137 | 2.7 | 75.4 | 18.9 | 0.3 | 0.9 | 0 | 41016 | 55317 | 2838 | 27824 | 2024 | 13.7 | 20.1 | 35.3 | 18.0 | 23.2 | 8.1 | 15.4 | 86.0 | 9.6 | 0.1 | 0.6 | 1.3 | 2.5 | 25.8 | 24112 | 74.1 | 20.2 | 5.6 | 0.1 | 5.2 |
| 1003 | Alabama | Baldwin County | 203360 | 99527 | 103833 | 4.4 | 83.1 | 9.5 | 0.8 | 0.7 | 0 | 155376 | 52562 | 1348 | 29364 | 735 | 11.8 | 16.1 | 35.7 | 18.2 | 25.6 | 9.7 | 10.8 | 84.7 | 7.6 | 0.1 | 0.8 | 1.1 | 5.6 | 27.0 | 89527 | 80.7 | 12.9 | 6.3 | 0.1 | 5.5 |
| 1005 | Alabama | Barbour County | 26201 | 13976 | 12225 | 4.2 | 45.7 | 47.8 | 0.2 | 0.6 | 0 | 20269 | 33368 | 2551 | 17561 | 798 | 27.2 | 44.9 | 25.0 | 16.8 | 22.6 | 11.5 | 24.1 | 83.4 | 11.1 | 0.3 | 2.2 | 1.7 | 1.3 | 23.4 | 8878 | 74.1 | 19.1 | 6.5 | 0.3 | 12.4 |
| 1007 | Alabama | Bibb County | 22580 | 12251 | 10329 | 2.4 | 74.6 | 22.0 | 0.4 | 0.0 | 0 | 17662 | 43404 | 3431 | 20911 | 1889 | 15.2 | 26.6 | 24.4 | 17.6 | 19.7 | 15.9 | 22.4 | 86.4 | 9.5 | 0.7 | 0.3 | 1.7 | 1.5 | 30.0 | 8171 | 76.0 | 17.4 | 6.3 | 0.3 | 8.2 |
| 1009 | Alabama | Blount County | 57667 | 28490 | 29177 | 9.0 | 87.4 | 1.5 | 0.3 | 0.1 | 0 | 42513 | 47412 | 2630 | 22021 | 850 | 15.6 | 25.4 | 28.5 | 12.9 | 23.3 | 15.8 | 19.5 | 86.8 | 10.2 | 0.1 | 0.4 | 0.4 | 2.1 | 35.0 | 21380 | 83.9 | 11.9 | 4.0 | 0.1 | 4.9 |
The data set election.raw has more county than the data set census.
| candidate | TOTAL |
|---|---|
| Alyson Kennedy | 6791 |
| Bill Hammons | 6647 |
| Blake Huber | 409 |
| Brian Carroll | 25256 |
| Brock Pierce | 49552 |
## [1] "There are total 38 named presidential candidates in the 2020 election"
### State winner and County winner
| county | state | candidate | party | total_votes | total | pct |
|---|---|---|---|---|---|---|
| Abbeville | South Carolina | Donald Trump | REP | 8215 | 12433 | 0.6607416 |
| Abbot | Maine | Donald Trump | REP | 288 | 417 | 0.6906475 |
| Abington | Massachusetts | Joe Biden | DEM | 5209 | 9660 | 0.5392340 |
| Acadia Parish | Louisiana | Donald Trump | REP | 22596 | 28425 | 0.7949340 |
| Accomack | Virginia | Donald Trump | REP | 9172 | 16962 | 0.5407381 |
| state | candidate |
|---|---|
| Alabama | Donald Trump |
| Alaska | Donald Trump |
| Arizona | Joe Biden |
| Arkansas | Donald Trump |
| California | Joe Biden |
| CountyId | State | County | TotalPop | Men | Women | White | VotingAgeCitizen | Income | Poverty | ChildPoverty | Professional | Service | Office | Production | Drive | Carpool | Transit | OtherTransp | WorkAtHome | MeanCommute | Employed | PrivateWork | SelfEmployed | FamilyWork | Unemployment | Minority |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1001 | Alabama | Autauga County | 55036 | 48.87528163% | 28137 | 75.4 | 74.52576495% | 55317 | 13.7 | 20.1 | 35.3 | 18.0 | 23.2 | 15.4 | 86.0 | 9.6 | 0.1 | 1.3 | 2.5 | 25.8 | 43.81132350% | 74.1 | 5.6 | 0.1 | 5.2 | Pacific |
| 1003 | Alabama | Baldwin County | 203360 | 48.94128639% | 103833 | 83.1 | 76.40440598% | 52562 | 11.8 | 16.1 | 35.7 | 18.2 | 25.6 | 10.8 | 84.7 | 7.6 | 0.1 | 1.1 | 5.6 | 27.0 | 44.02389851% | 80.7 | 6.3 | 0.1 | 5.5 | Pacific |
| 1005 | Alabama | Barbour County | 26201 | 53.34147552% | 12225 | 45.7 | 77.35964276% | 33368 | 27.2 | 44.9 | 25.0 | 16.8 | 22.6 | 24.1 | 83.4 | 11.1 | 0.3 | 1.7 | 1.3 | 23.4 | 33.88420289% | 74.1 | 6.5 | 0.3 | 12.4 | Pacific |
| 1007 | Alabama | Bibb County | 22580 | 54.25597874% | 10329 | 74.6 | 78.21966342% | 43404 | 15.2 | 26.6 | 24.4 | 17.6 | 19.7 | 22.4 | 86.4 | 9.5 | 0.7 | 1.7 | 1.5 | 30.0 | 36.18689105% | 76.0 | 6.3 | 0.3 | 8.2 | Asian |
| 1009 | Alabama | Blount County | 57667 | 49.40433870% | 29177 | 87.4 | 73.72153918% | 47412 | 15.6 | 25.4 | 28.5 | 12.9 | 23.3 | 19.5 | 86.8 | 10.2 | 0.1 | 0.4 | 2.1 | 35.0 | 37.07493020% | 83.9 | 4.0 | 0.1 | 4.9 | Pacific |
In order to have a better result, I choose center and scale the features before running the PCA. I removed ‘Minority’ which is a character data type column, also covert ‘Men’, ‘VotingAgeCitizen’, ‘Employed’ from percentage to numbers so we could get better results.
## ChildPoverty Poverty Employed
## 0.3884449 0.3833092 0.3624495
The three features with the largest absolute values of the first principle component are ChildPoverty, Poverty, Employed.
## OtherTransp PrivateWork VotingAgeCitizen
## 0.001375716 0.049135190 0.050048362
The opposite signs are otherTransp, PrivateWork, VotingAgeCitizen. And it means that these three variables are not straight related with the data. In other words they only had light correlation with the target.
We need around 10 PCs to capture 90% of the variance for the analysis
Applying clustering method to the data set
## clus.10
## 1 2 3 4 5 6 7 8 9 10
## 2131 128 892 6 14 1 11 7 25 4
## clus.10_pc_First_Twp
## 1 2 3 4 5 6 7 8 9 10
## 3036 6 1 114 19 3 20 1 15 4
The first 2 component clustering seems better because it has less small classes of the clustering. First 2 component clustering seems more appropriate to Santa Barara County because it puts with multiple other California counties
The reason why we need to exclude the predictor ‘party’ from election.c1 is because party is a character data type columns, further more we are predicting who has won the county/state which means we are looking for census and population’s data to predict who will they vote for. And party is describing the candidate which is not our aim.
| train.error | test.error | |
|---|---|---|
| tree | 0.0775087 | 0.0856354 |
| logistic | NA | NA |
| lasso | NA | NA |
The test error rate is high which means the model might be over fitting and decision tree is not the best algorithm during this situation
According to the graph, we see that decision tree first had separate transit and second it would depend on if the citizen is white or not. Further more it would depend on self-employed or professional and total population or production.
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Call:
## glm(formula = candidate ~ ., family = binomial, data = election.tr)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7362 -0.2479 -0.0851 -0.0104 3.8603
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.598e+00 6.424e+00 -1.183 0.236872
## TotalPop -9.395e-06 4.041e-05 -0.232 0.816167
## Men -2.872e-04 3.162e-04 -0.908 0.363709
## Women 2.197e-05 7.976e-05 0.275 0.783028
## White -1.348e-01 1.301e-02 -10.358 < 2e-16 ***
## VotingAgeCitizen 2.299e-03 3.702e-04 6.211 5.25e-10 ***
## Income -7.479e-06 2.131e-05 -0.351 0.725670
## Poverty -1.532e-02 5.297e-02 -0.289 0.772355
## ChildPoverty 2.152e-02 3.286e-02 0.655 0.512459
## Professional 3.036e-01 5.104e-02 5.948 2.71e-09 ***
## Service 3.045e-01 6.085e-02 5.004 5.62e-07 ***
## Office 2.012e-01 6.347e-02 3.170 0.001524 **
## Production 2.005e-01 5.179e-02 3.871 0.000108 ***
## Drive -2.011e-01 4.799e-02 -4.191 2.78e-05 ***
## Carpool -2.118e-01 6.532e-02 -3.242 0.001185 **
## Transit 2.652e-01 1.211e-01 2.191 0.028485 *
## OtherTransp 1.750e-02 1.225e-01 0.143 0.886382
## WorkAtHome -1.105e-01 7.368e-02 -1.500 0.133514
## MeanCommute 6.144e-03 3.168e-02 0.194 0.846227
## Employed 3.054e-03 5.052e-04 6.045 1.50e-09 ***
## PrivateWork 3.075e-02 2.647e-02 1.162 0.245339
## SelfEmployed 9.804e-05 5.703e-02 0.002 0.998628
## FamilyWork -1.813e+00 6.984e-01 -2.596 0.009442 **
## Unemployment 1.971e-01 5.340e-02 3.692 0.000223 ***
## MinorityBlack 1.245e+00 1.340e+00 0.929 0.353022
## MinorityHispanic -9.237e+00 6.357e+02 -0.015 0.988406
## MinorityNative 2.375e+00 1.127e+00 2.108 0.035011 *
## MinorityPacific 2.167e+00 1.122e+00 1.932 0.053386 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1497.31 on 1444 degrees of freedom
## Residual deviance: 498.71 on 1417 degrees of freedom
## AIC: 554.71
##
## Number of Fisher Scoring iterations: 14
## (Intercept) TotalPop Men Women
## -7.598256 -0.000009 -0.000287 0.000022
## White VotingAgeCitizen Income Poverty
## -0.134788 0.002299 -0.000007 -0.015325
## ChildPoverty Professional Service Office
## 0.021523 0.303613 0.304483 0.201189
## Production Drive Carpool Transit
## 0.200494 -0.201114 -0.211797 0.265177
## OtherTransp WorkAtHome MeanCommute Employed
## 0.017500 -0.110546 0.006144 0.003054
## PrivateWork SelfEmployed FamilyWork Unemployment
## 0.030753 0.000098 -1.812870 0.197134
## MinorityBlack MinorityHispanic MinorityNative MinorityPacific
## 1.244683 -9.237465 2.375326 2.167049
| train.error | test.error | |
|---|---|---|
| tree | 0.0775087 | 0.0856354 |
| logistic | 0.0643599 | 0.0911602 |
| lasso | NA | NA |
The significant variable became professional and it is different than the decision tree model. The coefficient for Professional is around 0.39 which means it is affecting the decision at most in all of the variables.
## 25 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -4.747366
## (Intercept) .
## TotalPop 0.000001
## Men -0.000367
## Women 0.000002
## White -0.113226
## VotingAgeCitizen 0.002067
## Income 0.000004
## Poverty 0.026369
## ChildPoverty 0.008004
## Professional 0.220289
## Service 0.216087
## Office 0.140080
## Production 0.115647
## Drive -0.143360
## Carpool -0.149760
## Transit 0.235636
## OtherTransp 0.065751
## WorkAtHome -0.049878
## MeanCommute -0.001856
## Employed 0.002587
## PrivateWork 0.026405
## SelfEmployed -0.048529
## FamilyWork -1.351018
## Unemployment 0.171172
The optimal value for \(\lambda\) is 0.0013. Non-zeros are white, transit, unemployment.Compare to unpenalized logistic regression, lasso regression has enhanced the key coefficients.
## [1] "The test error rate is 0.0828729281767956"
## var rel.inf
## Transit Transit 21.7580123
## White White 20.5745261
## TotalPop TotalPop 8.3418241
## Professional Professional 7.1804795
## Women Women 5.2614706
## VotingAgeCitizen VotingAgeCitizen 4.9431347
## Employed Employed 4.4275587
## Unemployment Unemployment 3.0682501
## Men Men 2.8770159
## SelfEmployed SelfEmployed 2.8269815
## Service Service 2.5871555
## Production Production 2.4250571
## Income Income 1.9888523
## ChildPoverty ChildPoverty 1.9331697
## PrivateWork PrivateWork 1.7037428
## MeanCommute MeanCommute 1.3529113
## WorkAtHome WorkAtHome 1.3049564
## Office Office 1.2172617
## OtherTransp OtherTransp 1.1975903
## Drive Drive 1.1621989
## Poverty Poverty 1.0896626
## Carpool Carpool 0.6107129
## FamilyWork FamilyWork 0.1674751
## [1] "The test error rate is 0.0856353591160221"
From the graph we can see the relation between emplyed and unedployment with the prediction
Random forest had better result than decision tree because random forest avoid the error(over fitting) caused by multiple tree classes.And boosting is having similar result as decision tree, logistic regression, and lasso regression. However, the top variable is different than decision tree, logistic regression, and lasso regression.
##
## Call:
## lm(formula = total_votes ~ TotalPop + Men + Women + White + VotingAgeCitizen +
## Income + Poverty + ChildPoverty + Professional + Service +
## Office + Production + Drive + Carpool + Transit + OtherTransp +
## WorkAtHome + MeanCommute + Employed + PrivateWork + SelfEmployed +
## FamilyWork + Unemployment, data = lm.data.tr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -340263 -41276 -28222 -768 1661763
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.387e+05 1.428e+05 2.372 0.017804 *
## TotalPop -4.799e-01 6.538e-01 -0.734 0.463119
## Men -2.049e+00 3.423e+00 -0.599 0.549485
## Women 1.505e+00 1.288e+00 1.169 0.242798
## White 2.827e+02 1.994e+02 1.418 0.156387
## VotingAgeCitizen 4.311e+00 3.899e+00 1.106 0.269070
## Income -3.751e-01 5.096e-01 -0.736 0.461850
## Poverty -1.610e+03 1.317e+03 -1.222 0.221799
## ChildPoverty 7.503e+02 7.397e+02 1.014 0.310601
## Professional -7.225e+02 9.109e+02 -0.793 0.427792
## Service -1.106e+03 1.096e+03 -1.009 0.313191
## Office -3.913e+03 1.178e+03 -3.321 0.000920 ***
## Production -1.990e+02 9.266e+02 -0.215 0.829946
## Drive -8.062e+02 1.238e+03 -0.651 0.515144
## Carpool -1.778e+03 1.569e+03 -1.133 0.257220
## Transit -2.581e+03 1.819e+03 -1.419 0.156103
## OtherTransp -1.934e+03 2.730e+03 -0.708 0.478790
## WorkAtHome -5.575e+01 1.847e+03 -0.030 0.975928
## MeanCommute 1.912e+02 6.075e+02 0.315 0.752962
## Employed 3.363e+00 5.343e+00 0.629 0.529174
## PrivateWork -7.877e+02 6.055e+02 -1.301 0.193534
## SelfEmployed -3.940e+03 1.170e+03 -3.369 0.000776 ***
## FamilyWork -5.037e+03 6.330e+03 -0.796 0.426321
## Unemployment 5.610e+02 1.353e+03 0.415 0.678405
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 97860 on 1421 degrees of freedom
## Multiple R-squared: 0.5592, Adjusted R-squared: 0.5521
## F-statistic: 78.39 on 23 and 1421 DF, p-value: < 2.2e-16
## [1] "The MSE is 13473109137.6167"
| actual | Donald Trump | Joe Biden | Prediction | |
|---|---|---|---|---|
| 6 | Donald Trump | 0.8314220 | 0.1686340 | Donald Trump |
| 11 | Joe Biden | 0.6474821 | 0.3525925 | Donald Trump |
| 14 | Joe Biden | 0.1000002 | 0.9000184 | Joe Biden |
| 16 | Joe Biden | 0.2523672 | 0.7477261 | Joe Biden |
| 27 | Donald Trump | 0.8314220 | 0.1686340 | Donald Trump |
## [1] "The test error rate is 0.213035606517803"
| train.error | test.error | |
|---|---|---|
| tree | 0.0775087 | 0.0856354 |
| logistic | 0.0643599 | 0.0911602 |
| lasso | NA | NA |
| random forest | NA | 0.0828729 |
| boosting | NA | 0.0856354 |
| neural network | NA | 0.2130356 |